Latent evolutionary signatures: a general framework for analysing music and cultural evolution

Cultural processes of change bear many resemblances to biological evolution. The underlying units of non-biological evolution have, however, remained elusive, especially in the domain of music. Here, we introduce a general framework to jointly identify underlying units and their associated evolutionary processes. We model musical styles and principles of organization in dimensions such as harmony and form as following an evolutionary process. Furthermore, we propose that such processes can be identified by extracting latent evolutionary signatures from musical corpora, analogously to identifying mutational signatures in genomics. These signatures provide a latent embedding for each song or musical piece. We develop a deep generative architecture for our model, which can be viewed as a type of variational autoencoder with an evolutionary prior constraining the latent space; specifically, the embeddings for each song are tied together via an energy-based prior, which encourages songs close in evolutionary space to share similar representations. As illustration, we analyse songs from the McGill Billboard dataset. We find frequent chord transitions and formal repetition schemes and identify latent evolutionary signatures related to these features. Finally, we show that the latent evolutionary representations learned by our model outperform non-evolutionary representations in such tasks as period and genre prediction.


Introduction
Molecular evolution involves the changes in frequency of variations in genomic sequences (alleles) in a population over time.When DNA is replicated, each nucleotide (C/G/A/T) can be replaced by another in a process known as a mutation, and evolutionary models of increasing complexity can account for different mutation rates of transitioning from one nucleotide to another (figure 1).These mutations can be inherited and propagated through selection, genetic drift and other neutral processes [1] (see electronic supplementary material, table S1 for a glossary of key terms).Characterizing these nucleotide transitions is an important task in different fields of molecular evolution, from phylogenetics [2], genotyping [3], microbial evolution [4] and cancer genomics [5,6].Particularly, in cancer genomics, the underlying processes that cause mutations (e.g.ultraviolet radiation, ageing, damage to DNA mismatch repair genes, etc.) can be linked to specific mutational signatures and cancer types [6][7][8].
In parallel to biological evolution, cultural evolution is also a theory of change, notably social change, where culture may be defined narrowly as socially learned behaviour or more broadly as any extrasomatic adaptation or representation (including, for instance, artefacts), and underlying structural   (%) 0,10,0 0,10,5 0,14,0 0,14,7 0,17,0 0,19,0 0,2,0 0,2,7 0,21,0 0,21,5 0,3,0 0,5,0 0,5,10 0,5,4 0,5,21 0,5,7 0,7,0 0,7,21 0,7,5 0,8,0 1,8,1 10,0,10 10,0,5 10,5,0 14,5,14 14,0,5 14,5,0 14,7,0 14,7, Figure 1.Illustrates correspondence between latent signature models in evolutionary cancer genomics and musical evolution.(a) Biological evolution is a theory of change often signifying alterations in the genetic code of DNA nucleotide sequences.We draw parallels with music by replacing the genetic code with a sequence of notes or chords.(b) While in biological evolution we derive a mutational spectrum by considering the point mutations from an ancestral to an offspring sequence in the form of k-mers, in music we construct a transitional motif spectrum that denotes the frequency of possible chord transitions within a song, song family or category (e.g.composer, era, genre).As a biological example, we show a mutational spectrum observed in cancer, as adapted from [6].As a musical example, we show the motif spectrum of three popular songs for comparison, where chord transitions have been normalized by the tonic chord.(c) The mutational or motif spectra can be, in turn, decomposed into latent signatures in a linear or nonlinear manner respectively.In the case of cancer, the mutational spectrum can be linearly reconstructed through its latent signatures.In the case of the music evolution model presented, a nonlinear evolutionary neural network with binary codes is used to identify the corresponding latent signatures and their activations through time.(d ) Different forms of phylogenetic graph underlying the latent signatures model, with G defined by a tree (i) and temporal window function (ii).Model variables are notated in parentheses, see text ( §2.2).
royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230647 and biological factors may influence the transmission and generation of behaviour [9][10][11].As an early attempt at an integrated framework, 'memetics' [12] has been widely used to represent cultural information transfer, where a 'unit of culture' or 'meme' (idea, belief, behaviour, etc.) can propagate through imitation (or 'mimesis').However, due to the lack of a stable 'code' for these memes (compared with DNA for genes), the memetics framework has also been described as lacking in explanatory power [13,14].More recently, the framework of dual inheritance theory [15] has been proposed as a unified approach to evolution in biological and cultural domains, which avoids many of the shortcomings of the memetics framework; for instance, hereditary need not be based on discrete replicators, and variation can arise from many different sources including 'non-random' processes such as human creativity.Other recent approaches have suggested that cultural phenomena may often be more usefully modelled using non-Darwinian evolutionary processes, for instance autocatalytic networks associated with origins-of-life [11,16].In such models, for instance, individual units of selection and mechanisms of heritability are not predefined, but rather emerge out of the underlying network dynamics.
Music, as a cultural artefact, may be viewed as embedded in a process of cultural evolution [17].For instance, each song or work can influence and be influenced by other songs (for instance, via harmony, form, etc.), providing an analogue of replication and inheritance.Songs are continually subjected to changing cultural tastes that can act analogously to a selective environment, while also respecting underlying functional constraints, such as harmonic syntax [18].Further, songs within a specific genre can have a level of homology, as a result of their shared lineage.The canon of twentieth-century western commercial popular music is uniquely suited to this form of evolutionary analogy.Individual 'songs' in this group are well defined: each composition has its own unique and definitive 'master recording'.Nearly, all of these songs share a common syntax of scalar melodies and tertian chordal harmonies rooted in 12-tone equal temperament, and a form based on the repetition and variation of musically similar units.This common structure allows for an efficient means of identifying and comparing different works.Further, the development of popular music in the twentieth century was uniquely enabled by mechanisms of direct influence, as radio play, record sales and other means of near-perfect replication were dominant factors in popular music composition and production [19].
While the above is suggestive as a viewpoint, the question of which types of influence should count as evolutionary is an open problem, as is the extent to which concepts such as mutation and the genotype-phenotype distinction can be applied to musical artefacts [20].Historically, different approaches have been applied to bridge the gap between music phylogenetics and biological evolution [10].Compared with language, where evolutionary relationships can be comfortably represented as phylogenetic trees or networks [21], music evolution is still often considered as a loose metaphor [22][23][24][25], where the evolutionary aspects are yet to be shown [9,26].Recent phylogenetic analyses of Gabonese music [27] and electronic music [28] suggest that these corpuses develop via both 'tree-like' vertical transmission and horizontal transmission of key musical traits.Another model, based on copies of a Renaissance-era manuscript, suggests some tree-like relationships between variants of notated representations of music [29].
Previous computational approaches to musical evolution have analysed changes in audio features [30][31][32], interval use [31] and song selection [33].While previous methods have considered how characteristic variables extracted from such features change with time, such as principal component analysis (PCA) [30], topics [34] or measures of information-theoretic complexity [32], these are predefined, rather than extracted through fitting an integrated model.Further, while not explicitly evolutionary in focus, other approaches have considered changes in harmonic usage in time either through derived features from individual chords [35] or chord transitions using hidden Markov models [36].These approaches primarily consider changes in surface features, as opposed to features learned via fitting an explicit evolutionary model.
To tackle the problem of identifying potential evolutionary processes in music, we adopt a machine learning perspective.We develop a latent evolutionary model which directly models the generation of observed musical data, such as chordal sequences from songs [30], using an underlying hidden binary code representation.We are influenced here by recent work in evolutionary genomics, which has shown that it is possible to extract signatures of mutational processes from cancer genomics data [7,8], allowing latent factors to be inferred which are responsible for a cancer's growth [6][7][8]37,38] (figure 1).Specifically, we introduce a model that allows us to identify underlying evolutionary 'signatures' de novo, by optimizing the reconstruction error in a variational autoencoder (VAE) framework [39], which can model arbitrarily complex generative processes using deep neural network decoders.Recent extensions of the VAE framework have considered adding extra structure to the latent space, such as graph structure [40] or cluster structure [41], to derive more interpretable representations for particular applications.In our case, we add an evolutionary structure to the latent space by incorporating an energy-based prior, which encourages songs close in evolutionary space to share similar codes.The prior thus directly embeds a notion of mutational distance between codes, while the decoder allows a complex map to be learned between codes and observed phenotypes.Hence, our model may be compared with baseline PCA and VAE models; a PCA model can be denoted X i ≈ Z i θ, where row-vector Z i represents the component (signature) coefficients in individual (song) i, and θ is the PCA projection matrix, while a VAE model may be denoted X i ∼ P(.|Z i , θ)P(Z i ), where P(Z i ) is a prior over the latent space, and P(.|Z i , θ) may incorporate a nonlinear mapping between the latent and observed spaces.We note that topic models such as latent Dirichlet allocation (LDA) are linear, but, unlike PCA, allow for non-orthogonality between topics.Unlike PCA and LDA, and like a VAE, our model may incorporate nonlinear dependencies when mapping from signatures to the output space, while unlike PCA, LDA and VAE models, ours incorporates temporal dependencies between the Z 0 i s based on an evolutionary graph G, representing putative ancestral/influence relationships (referred to collectively as the 'evolutionary structure', while closeness on this graph is referred to as closeness in 'evolutionary space'), and hence has the form X 1...N Q i¼1...N P(X i jZ i ,u) We show in our results that incorporating G leads us to infer signatures that improve the log-likelihood on hold-out data.Further, electronic supplementary material, figure S1 provides a schematic comparing the structure of our model with a standard VAE model.
Conceptually, there are many points of similarity between genomics and musical data, such as their sequential nature and the existence of characteristic 'motifs' (k-mers, or repeated sequences of bases/letters/chords) which allow us to draw on many similar analytic techniques in analysing evolutionary processes in these domains (summarized in electronic supplementary material, table S2).However, we briefly note some of the dissimilarities to avoid potential points of confusion in how our model is to be interpreted.First, we note that in the genomics case, the DNA sequence is a code (genotype) for a high-level phenotype (cell growth), and that the latent signatures lie at a level below the genotype, generating patterns of mutations.By contrast, musical sequences such as chord patterns are perhaps more naturally considered phenotypes with respect to a musical work, although musical notation (e.g.sequences of notes in a score or chords in tablature notation) may also be considered a 'code' for a musical recording.In our proposed model then, the latent signatures provide a generative code for the sequences themselves, as opposed to mutation patterns between sequences.We note here two comparisons which are informative.First, consider a cultural material artefact, such as a clay pot; here, the pot is the phenotype, while the genotype is naturally considered as a series of actions or instructions which generate the object, since the latter are passed on and mutate to produce different pots, in a similar way to the ideas underlying a musical work (genres, expressive goals, musical syntax).Second, consider language-based artefacts, such as novels; here, the text may be considered part of the phenotype, while the generating ideas (here, literary themes, character types, etc.) may be considered an underlying genotype; similar to musical works, novels do not primarily 'evolve' by authors copying previous texts and changing letters, but rather by the recombination of previous literary ideas.As a general point, we note that in the case of music and many cultural domains, there may be no 'fact of the matter' as to which levels should be regarded as genotype and phenotype; rather, we are interested in to what extent adopting the viewpoint sketched can uncover cultural units (here, stylistic units) that act analogously to genetic elements in an evolutionary process.We note therefore that our model does not require sequential data (for instance, we may learn latent evolutionary signatures for images of artworks), and nor does it presuppose particular mechanisms by which cultural influence operates (unlike genetic processes, cultural transmission may be important at the phenotype level [15]).
As an illustration, we analyse songs from the McGill Billboard dataset from 533 artists across the period 1958-1991 in 64 genres [35,42].We represent each song as a distribution over chord transitions (k-mers) and analogous transitions between formal units (representing a song's repetition structure), and apply our model to identify latent evolutionary signatures behind these distributions.We first interpret these signatures, by identifying them with features of changing harmonic syntax and formal structure.We then evaluate the representations learned by their ability to perform period and genre prediction on holdout test data (20% of the dataset).Our evolutionary model significantly outperforms other non-evolutionary models on these tasks, suggesting that the evolutionary structure is informative.We are thus able to statistically test for evolutionary structure where the units of transmission are unknown and must be inferred de novo.Finally, code and data associated with our framework can be found at: https://github.com/gersteinlab/Musevo.

Latent evolutionary signatures model 2.1. Extracting musical features
We begin by describing our approach to extracting harmonic features from the McGill Billboard dataset songs [35,42].In analogy to mutational spectra that are used to reconstruct specific signatures in cancer genomics [7,8] (figure 1), we first determine the frequency of specific chord motifs and chord transitions within each song, on which the latent evolutionary signatures will be based.The raw data consist of N songs, each represented by a sequence of chords where a n,i ∈ {0 … 11} represents the root pitch-class of the ith chord (letting Ab = 0, A = 1, etc.) of the n'th song, and b n,i ∈ {0, 1} represents whether the chord is major or minor (0/1, respectively).For convenience, we do not encode added 7ths/9ths, inversions or other chordal variants, and we prune the chordal sequences to remove chordal repetitions.Additionally, for each song we represent the tonality of the song by the pair (t (1)  n ,t (2)  n ), where t (1)  n [ {0 . . .11} represents the tonic, and t (2)  n [ {0, 1} the mode (major/minor).
We generate a harmonic feature vector for each song, X n , by applying a set of filters to each chord sequence.The first set of (basic) filters represents all possible chordal transitions of length K. Here, the fth filter is represented by a transition vector, [t f,0 , … t f,K−2 ], where t f,i ∈ {0 … 11}, and a binary chord-type vector Here, S (1)  n,f,i and S (2)  n,f,i are the 'scores' for filter f on song n at position i, representing agreement between the chord transitions and the major/minor chord types, respectively.Further, [.] is the Iverson bracket, which is 1 for a true proposition, and 0 otherwise.Since there are 12 K−1 transition vectors and 2 K chord-type vectors, there are 12 K−1 • 2 K possible basic filters in total.
We also consider a second set of filters, which are normalized to the tonic/tonality of each song (τ-normalized), hence transposing each song to an arbitrary common reference.Here, the gth filter is again represented by a transition vector, [t g,0 , … t g,K−1 ], royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230647 where t g,i ∈ {0 … 11} and a binary chord-type vector [u g,0 , … u g,K ], where u g,i ∈ {0, 1}, but here the transition vector represents the offset relative to the tonic.Additionally, an extra bit is added to u g to represent the key of the song.The response of tonality normalized filter g to song n is calculated, S (1)  n,g,i S (2)  n,g,i , For the tonality-normalized filters, there are 12 K transition vectors, and 2 K+1 chord-type vectors, leading to 12 K • 2 K+1 possible τ-normalized filters in total.
Both the basic filters and the τ-normalized filters described above can be represented using an indexing scheme for pairs of chords based on mod-12 arithmetic.We describe this indexing scheme in more detail in §3.1, and illustrate it in figure 2.

Model formulation
We now describe our model for latent evolutionary signatures, and briefly outline our training algorithm for the model.The model requires as input a set of observed training data vectors X n=1…N , for which we use the harmonic features defined in the previous subsection (although other types of features may also be used).In addition, we require a weighted graph over the training examples G, which is a positive symmetric matrix of size N × N, where G(i, j ) represents the closeness in evolutionary space of training samples i and j.For our current investigation, we predefine G by ordering the songs in the training set by date, and connecting each song to all other songs in an overlapping temporal window of size w on either side (figure 1d The evolutionary model then fits a latent code to each song, Z n ∈ {0, 1} B , where B is the number of bits in the latent code vectors, corresponding to the number of evolutionary signatures.Further, the model fits a neural network which provides a generative model of the observed feature vectors X n from the latent codes.The likelihood of the model combines a royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230647 reconstruction loss with a prior over the codes, which penalizes large changes in the latent vectors between strongly related songs according to G, P(X,Z) ¼ P(XjZ)P(Z), Here, NN(Z i ;θ) is a neural network parametrized by θ, P gen (.|.) is the generative model for the observed data (for instance, a Bernoulli model if the features of X i are thresholded at 1, and the output of the neural network represents the probability that each feature is 1), and E(Z ) is an energy model defined over the latent vectors, which penalizes pairs of latent vectors by the product of their closeness in the underlying evolutionary graph and their Hamming distance weighted by γ.This form of E(Z ) is motivated by considering a point-wise mutational process acting on code vectors between related songs.Further, we note that, while we have predefined G, alternatively a prior may be placed over G (for instance, enforcing that G has a tree or directed acyclic graph (DAG) structure, which respects temporal ordering, figure 1), and it may be considered an additional parameter of the model likelihood.
We fit the model in equation ( 2.3) by optimizing the evidence lower-bound on the likelihood (ELBO) [39], while using a meanfield approximation as our variational distribution over the latent space [43].For our model, the ELBO has the form where Q(Z ) is the mean-field distribution across the latent space, and KL(.k.) is the Kullback-Leibler (KL) divergence.For convenience, we assume that B is small enough that Q may be represented by an explicit discrete distribution across all code-vectors, with each song treated independently, where β ∈ {0, 1} B .The bound in equation (2.4) can be optimized by variational Bayes expectation-maximization (VBEM) [44].For the E-step, this requires optimizing the local mean-field parameters q i,β .Using standard mean-field results [43], this results in the local updates, where P gen (X i |β) = P gen (X i |NN(β;θ)), and For the M-step, code-vectors may be sampled for each song from Q(Z i ), and θ optimized by gradient descent to maximize log P gen (X i |NN(Z i ;θ)).Additionally, for visualization we calculate two sets of quantities.The first, denoted , represents the activation of signature i at time t, where the factor centres the activations at 1.The second, denoted w i,f = logθ i,f , represents the log weight of feature f given signature i is activated, where the model is assumed to be a one-layer neural network.

Period and genre prediction
For period and genre prediction tasks, we split the data into a training and test partition, X train , X test , where X test is formed by sampling every fifth song in chronological order.We first train the evolutionary signatures model on X train using the approach in §2.2.We then infer the maximum a posterior latent representation for each test song independently, using the marginal distribution across the training latent codes as a prior.Hence for test song j, we find where P(b) ¼ (1=N) We then perform period prediction using a kernel-regression approach.Hence, for the ith training song in chronological order, we specify the label y train,i = i/N.For test song j, we then predict where ) is the Hamming distance, and α is set by cross-validation as a hyper-parameter.To assess performance, we calculate the Pearson correlation coefficient between ŷj and the ground-truth test labels, y test,j = j/N test .
For genre prediction, each song may be assigned to multiple genres (rock, jazz, pop, etc.).We treat assignment to each genre as a separate binary classification task.For a given genre, we assign labels y train,i ∈ {0, 1} for songs belonging to the genre versus not royalsocietypublishing.org/journal/rsifJ. R. Soc.Interface 21: 20230647 belonging, respectively, and apply equation (2.9) to estimate a score for a test song with respect to the genre.Similarly, we assess performance by the Pearson correlation coefficient between the test scores and the ground-truth binary labels, and take the average Pearson correlation as a measure of genre prediction.

Incorporating formal features
We further explore incorporating formal features into the model formulation above.For these, we assume that we have C formal categories, with each song being represented by a vector F n of length l form (n), hence F n ¼ [c n,0 , . . .,c n,l form (n)À1 ], where c n,m ∈ {0…C}.As with the harmonic features, we define a set of filters for the formal features.The fth filter is represented by a vector, [d f,0 , … , d f, K−1 ], where d f,k ∈ {0 … C}.Then, the response of filter f to song n is calculated, S (1)   n,f,i and ½c n,iþj = d f,j ¼ 0 We test multiple methods for defining the formal categories, 0…C, below, including (i) using the predefined categories in the Montreal dataset, (ii) manually coarse-graining these into a smaller number of categories by hand-coding a projection matrix to merge semantically similar categories (e.g.outro/fadeout), (iii) defining a projection matrix which projects the M smallest categories onto an 'other' label, where M is set so that this label accounts for not more than 0.01 of the total observed labels.Further, we explore the impact of combining both harmonic and formal features, by concatenating their feature vectors, i.e.

McGill Billboard dataset
We first describe the initial processing we apply to the McGill Billboard dataset [35,42].We remove all duplicate songs, leaving us with 730 songs in total.We order these by date, and take every fifth song to form a testing set, thus splitting the data into a training and testing set of 584 and 146 songs, respectively, each containing songs evenly distributed across the period 1958-1991 (see electronic supplementary material, figure S2 for a schematic summary of the dataset statistics).We calculate basic and τ-normalized harmonic features for each song, as described in §2.1, for K = 1, 2, 3, 4. To create a compressed feature vector, we use only filters which are non-zero in at least 100 songs to form the data matrix, X.For implementation purposes, we further encode the filters described in §2.1 using a mod 24 encoding scheme, illustrated in figure 2a, and described in more detail in the electronic supplementary material, methods.To gain a preliminary overview of the dataset statistics, we use this indexing scheme to quantify prevalent 3,4-chord motifs (figure 2b), and to construct motif spectra for each song (figure 2c).Finally, we decompose every song's indexed residuals (as obtained from the chord transitions), using single value decomposition to identify shared similarities between songs and chord transitions.Figure 2d shows the correspondence analysis between principal components PC2 and PC8, while a heatmap of the indexed residuals for all songs is shown in the electronic supplementary material, figure S3.

Evolutionary signature interpretation
We train models of evolutionary latent signatures ( §2.2) using three types of harmonic features: (i) K = 1 τ-normalized filters, corresponding to chord usage normalized by tonality, (ii) K = 4 basic filters, corresponding to transition sequences between groups of four chords, not normalized by tonality, and (iii) K = 4 τ-normalized filters, which are as in (ii), but normalized by tonality.In each case, we fix the number of latent signatures to be 5 (B = 5), and use a single-level neural network to aid interpretability, whose output is a vector of Bernoulli probabilities, predicting whether each filter gives a non-zero response in a song or not (see §2.2).These parameters are chosen to ensure tractable model training (approx.10 min per model), while permitting the discovery of meaningful musical units given the prevalence of 4-bar phrase structure in the songs considered.We train all models for 10 epochs of VBEM, taking 20 steps of gradient descent within each M-step to optimize the neural network parameters θ.We monitor the ELBO bound on the training log-likelihood and Pearson correlation (r) for period and genre prediction (discussed below) on the test set at each epoch, and set γ = 1 and w = 5 in equation (2.3) by cross-validation on r for period prediction (optimizing over γ = {0.1 1 10} and w = {1 2 5}).
Figure 3 shows the output of the training for these three models.The ELBO monotonically increases during training and begins to plateau at 10 epochs for all models (electronic supplementary material, figure S4). Figure 3a shows the signature activations for each song arranged chronologically, corresponding to Q(Z i,j = 1), the probability signature j is 1 in the ith song under the variational posterior (normalized by the mean per-signature activation).All three models exhibit evolutionary signatures with net positive and negative trends over time, as well as signatures with peaks at specific time periods (electronic supplementary material, figure S5 additionally provides a bootstrap-like stability analysis, showing that signatures with broadly similar trends are learned for multiple repetitions of the model training, with different initializations and dataset sizes).Figure 3b,c provides further viewpoints on the models to help interpret the signatures learned.In figure 3b, the log of the output weights over the royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230647 K = 1 harmonic features is plotted when each signature is turned on in turn.Since the K = 1 τ-normalized model uses single chords, there are only 48 possible features corresponding to the 24 major/minor chords at each possible offset from the tonic, under major and minor tonalities.The major tonality output weights are visualized (the minor tonality weights showed little variation between signatures, possibly due to a lack of training data in minor keys).A number of prominent features can be observed, such as the primarily diatonic distributions in signature 1, the up-weighting of major chords in signature 4, and the emphasis on bVI and bVII chords in signatures 3 and 5, with a de-emphasis on the dominant (7) in the latter.Figure 3c summarizes chordal sequences that receive the largest weights in the K = 4 models for each signature, where the non-normalized sequences are notated to begin on C or Cm, and the τ-normalized sequences are notated relative to a tonic on C. The table shows the top three sequences from each signature, after removing sequences which are rotationally equivalent.The full ranking for each sequence is provided in electronic supplementary material, figures S6-S7.Prominent features include signatures emphasizing only primary chords (I, IV, V), signatures involving mixed major and minor chords, and signatures involving whole-tone alternations and/or b7 emphasis (C,Bb,C,Bb).A further notable feature of the K = 4 τ-normalized model in figure 3a is the notable decrease in the signatures' variance over time.Indeed, when compared with models trained on randomly permuted temporal orderings, this trend is found to be significant (electronic supplementary material, figure S8), pointing towards a possible 'homogenization' trend in the 4-mer sequences used over this period.

Comparing models on prediction tasks
We use the kernel-regression approach of §2.3 to compare a variety of architectures for the evolutionary signatures model, as well as to compare the value of the signatures learned against other latent representations and approaches to performing these tasks.In each case, we plot the Pearson correlation of the predicted periods and genres on the holdout test data as described in §2.3.For convenience, we compare all models using the K = 1 τ-normalized harmonic features as inputs, while varying the size of the latent space, and number of layers in the neural network (NN) generative model (θ).
Figure 4a shows that performance on both prediction tasks increases during training, although model performance is not necessarily optimized for both tasks at the same training epoch (training shown for B = 5, 1 layer NN). Figure 4b further shows that on both tasks, the evolutionary signatures model outperforms predictions using the latent representations learned by a VAE with the same-sized latent space.Further, the plots suggest that approximately 6 latent signatures are optimal for generalization in the period prediction task, while more signatures are beneficial for genre classification.We test a maximum of B = 7 latent signatures, to allow us to perform exact optimization of Q(Z i ) over all configurations in the latent space for a given song, although using a Q(Z i ) which factors over both songs and signatures would allow us to test larger models.Figure 4c then tests the performance of evolutionary signatures and VAE models when the number of layers is varied in the generative NN (fixing B = 5 for both); as shown, the evolutionary signatures learned are consistently more informative than the VAE latent representations ( p = 3 × 10 −5 , sign-test across all model comparisons).Further, we tested the performance of the kernel-regression predictor when applied directly to the raw features to the models in figure 4.This gave r = 0.220, 0.217 for period and genre performance, respectively, which are substantially lower than r = 0.275, 0.243 for the best-performing evolutionary signatures models in figure 4 ( p = 2 × 10 −6 and p = 1 × 10 −5 for paired t-tests on the period and genre tasks, respectively, using per-instance squared-error and cross-entropy, respectively, the latter corresponding to an increase in test classification accuracy from 0.609 to 0.627).

Comparing models with combinations of formal and harmonic features
Finally, we compare models including combinations of formal and harmonic features.Figure 5a,b shows that, although formal features alone are less informative about the period of a song than harmonic features, in combination they produce optimal performance.Further, we note several putative patterns in the temporal evolution of the combined signatures learned in figure 5c,d: signature 1 (red), which has a strong diatonic profile (see for instance, ABBA's 'Take a Chance on Me', 1977), is associated with a broad distribution of formal features, and is highest at the start of the period; signatures 2 (green) and 4 (magenta) in contrast appear to be anticorrelated, and oscillate a number of times; signature 2 appears to be associated with 'compressed' song structure, via a low weighting of many formal features, while signature 4 appears to be associated with more expansive song structures (including high weights for repeated sections such as 'chorus-chorus(-silence)' and optional sections such as 'bridge' (see for instance Pat Benatar's 'Love is a Battlefield', 1983) as well as non-diatonic harmonic features, suggesting modulation (see, for instance, Queen's 'Bohemian Rhapsody', 1975; a full listing of the latent signature activations for each song is given  in the electronic supplementary materials).Figure 5e then shows that in general, coarse-graining of the formal categories helps performance, although the precise form of projection and normalization that achieve optimal performance varies with model.

Discussion
We have argued that the evolutionary processes underlying developments in musical styles, syntax and genres may be identified by explicitly incorporating evolutionary constraints into a generative model of a musical corpus.We are motivated by recent techniques in evolutionary genomics, which have identified 'mutational signatures' that underlie evolutionary processes in the context of cancer genomics.For this purpose, we propose a model of latent evolutionary signatures, which learns a latent 'code' for each song in a corpus, and jointly optimizes the codes to reconstruct the observed musical features and respect an underlying evolutionary structure.We note, however, two differences between our approach and mutational signature models in genomics: first, mutational signatures are not explicitly optimized to incorporate evolutionary constraints ( [7,8] use PCA components); second, mutational processes generate variation in genotypes, rather than acting directly as 'codes' themselves (although they may be viewed as latent factors in a two-level evolutionary process, as discussed in [45]).We do not claim that musical novelty is a chance process in the same way that genetic mutation is.Rather, we have used the term 'mutation' in this context to indicate any number of compositional decisions and events that may result in the change of musical norms over time, or from song to song.Finally, we trained our model on a variety of harmonic and formal features extracted from the McGill Billboard dataset, and showed that incorporating latent evolutionary structure led to more informative features (than matched VAE models) for the tasks of period and genre prediction.
To a large extent, our model may be viewed as an 'initial attempt' to fit evolutionary models to a musical corpus de novo, and we acknowledge that several aspects of our model are oversimplified.For convenience, our model adopts a memetic-based evolutionary architecture, where signatures are represented as discrete binary units and variation is introduced by a simple mutation rate which flips the units on/off with a fixed probability.Moreover, we do not explicitly model fitness.As noted in the introduction, these choices also emphasize the correspondence with genetic-based evolutionary models of cancer.However, we do not consider that the memetic viewpoint is essential to our approach.In this respect, our model could be straightforwardly adapted to integrate concepts from dual inheritance theory [15].For instance, the latent signatures may be represented in a continuous rather than discrete space, hence avoiding the assumption of an underlying set of discrete memes, and the modes of variation may be learned rather than assumed to be random (for instance, reflecting psychological biases in the types of variation introduced).Some models of creativity imply variation to be a product of complex dynamics in mental representation akin to the process of insight [46], and multiple formal models exist for modelling creativity [46][47][48][49], which could provide more accurate models of variation than pointwise mutations.Different types of imitation could also be introduced, ranging from direct modelling to a more diffuse influence, and more complex models of fitness can be incorporated, allowing different genres to maximize different desiderata or modelling frequency-dependent selection effects caused by novelty-preference models [50].Such (important) considerations are left for future work, and we conclude by discussing some of these, as well as the implications of our models, in more detail.

Musicological interpretation
Several features of the evolutionary signatures we extract are consistent with previous analyses of the McGill Billboard dataset.The conclusion in [36] that IV (normalized F) chords are the most common chords leading into or out of I (normalized C) is supported in figure 3c, where the normalized 2-mer C-F or F-C appears in four out of five signatures.The decrease over time of our signature 4 from figure 3b is consistent with similar observations in [42] that dominant chords decrease in frequency relative to subdominant and tonic chords.The same trend in signature 4 reflects the conclusion in [42] that minor chord usage increases with time over the dataset, and the increase over time of our signatures 3 and 5 from figure 3b, and signatures 4 and 5 from figure 3c for the non-normalized and τ-normalized models, respectively, corresponds to the observation in [36] that bVII becomes increasingly important, taking on the role of a substitute dominant in later periods.The decrease of our signature 1 from figure 3b with time, corresponding to traditional diatonic harmony, is also consistent with the increased use of modal harmonies and transitions noted in [36].In our model, these trends are tied to evolutionary signatures which link features that act together (for instance, bVII is tied to distinct signatures (3 and 5) in figure 3b, the former retaining a dominant emphasis and containing a moderate weight on bVI, while the latter reverses this weighting).Where the model in [36] had trouble accounting for differences in style across the diverse corpus, ours implicitly incorporates stylistic diversity and characterizes changes in the distribution of stylistic effects over time.Further, our combined analysis of harmonic and formal features suggests some ways in which these harmonic trends may be linked with song structure.As noted in §3.4,we observed a diatonic/complex-structure joint signature which appears to decrease with time, along with anticorrelated compressed and expansive formal signatures, the latter linked to possibly modulatory features.

Deep evolutionary modelling
As noted, the influence of musical (and cultural) entities on one another may be highly varied.For an evolutionary process to take shape, though, there must be some regularity in the transmission of 'cultural units' (which, as discussed, may be discrete as in the case of 'memes', or continuous latent variables).The problem of identifying these units/variables a priori may be viewed as an intractable task [9,23,24,26].Our model demonstrates that, instead, such features can be identified statistically through the process of fitting an evolutionary model to observed data.In this way, we do not need to explicitly define the units underlying the process, royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230647 but may discover them de novo through model fitting.In general, we do not expect these units to be directly observable as surface features of the phenotypes in question.We thus focus on discovering underlying deep memes or deep units of transmission which live in a latent space (either continuous or discrete, where the latter may also be viewed as a code), and whose relationship to the observed phenotypes may in general be highly complex, modelled by an arbitrary neural network.In this way, the memephenotype relationship (or latent unit to phenotype relationship) is somewhat analogous to the gene-phenotype relationship, both involving a complex generative process.As we described in §2, a latent evolutionary process may thus be formulated as a deep latent model, whose latent variables respect evolutionary relationships defined by an underlying ancestral graph (which may itself be discovered) between the modelled entities through similarity.The degree to which the deep memes or units of transmission so discovered by fitting models of this kind are explanatory, may be assessed through statistically testing the model fit against matched models which do not include evolutionary structure.Such techniques may be further explored in other cultural and biological contexts, for instance, language evolution and cancer mutational processes.Further, we note that, while in our framework we stress a particular interpretation of how the levels of 'genotype' and 'phenotype' map onto the analytic levels, distinct from the cancer evolution (compared in electronic supplementary material, table S2), we stress that the analytic techniques can be applied irrespective of the evolutionary interpretation adopted, providing a general model for learning nonlinear signatures constrained by a model of temporal variation.

Characterizing musical change as an evolutionary process
As discussed, to define a musical evolutionary process, we require a relationship corresponding to the ancestral or parental relationship in biological evolution.In our model, we propose that the analogous relationship is one of influence (or mimesis, see [12,51]), which may be defined in information-theoretic terms [45].The nature of the influence may be highly varied, and less precise than in the biological context; for instance, some parent songs have merely a stylistic influence on their offspring, while others influence particular motifs and phrases.Further, such a web of influences may generate a neutral or adaptive evolutionary process.Formally, an adaptive evolutionary process requires there to be a dependency of the number of offspring on a heritable phenotype of an individual (either a single phenotype, or a combination of phenotypes).By contrast, a neutral evolutionary process may still exhibit changes in the phenotypes of individuals, but these are due to statistical sampling effects (drift) rather than systematic dependencies between phenotype and number of offspring.In the context of musical evolution, such a definition translates to the net influence of a particular song or work on others.The type of dependencies with respect to phenotype that may be relevant here include factors such as cultural tastes, memorability and emotional salience or valence.Our latent evolutionary framework may be naturally extended to further distinguish between different types of evolution in music and other cultural spheres (neutral versus adaptive), for example by introducing a fitness parameter, or fitting a two-level process to identify genre-level effects [45].Although the possibilities above can be naturally modelled in our framework, it is more challenging to integrate non-Darwinian aspects of cultural evolution, such as those suggested by the 'honing theory of creativity', whereby cultural phenomena may be modelled by autocatalytic networks in which individual units of selection and mechanisms of heritability are not stable features of the process but rather emergent [46].The incorporation of techniques from information theory which allow a fuzzy definition of individuality may be ultimately necessary to accurately capture such features of cultural evolution [52].

Further extensions and applications
We finally note some further extensions to our model.First, although we focus on harmonic and formal features in our analysis, the model may be applied to any domain (for instance, melody, rhythm), as well as cross-domain analysis, and the incorporation of different sequence models (for instance, using filters with a larger range of values of K in combination with a sparsity prior, or using a transformer-based sequence model).We expect that domains differ in relative analytic importance based on a researcher's choice of musical tradition or corpus.In addition, our approach may be extended so that the underlying graph G is learned jointly with the other model parameters.Optimizing G provides a means of detecting influences between songs, and learning finegrained evolutionary signatures that reflect specific influences (as opposed to using the coarse approximation of influence due to temporal proximity).Such a model also provides a context for exploring features which are novel to particular songs and artists, in the context of evolutionary innovation (through mutation and neutral processes, see [53]).Finally, as we show, the latent representations learned by our model may be used for many tasks of interest.A particularly interesting application is recommender systems [33], where an individual's taste provides a selective environment in which a predictive evolutionary system may learn and interact.

Figure 2 .
Figure 2. Quantifying chords/transitions to construct k-mer spectra and perform correspondence analysis with single value decomposition (SVD).(a) Example data from Montreal dataset; chord symbols and formal units marked in blue and yellow, respectively.(b) Indexing scheme, (c) plots motif frequencies and spectra and (d ) plots songs in SVD space.See §3.1.

Figure 3 .
Figure 3. Evolutionary signature interpretation.(a) The normalized evolutionary signature activations over time for the signatures learned, and (b,c) provide a summary of the features prioritized by the three models (chords and sequences).See §3.2 for details.

Figure 4 .
Figure 4. Evolutionary signature (Ev-sigs) models outperform VAE models in period and genre prediction.(a) The evolutionary signature model's period (red) and genre (blue) prediction performance during training across epochs.(b) Compares evolutionary signature and VAE models' performance with different numbers of latent signatures, while (c) compares these models in terms of period (red) and genre (blue) prediction performance across neural nets with different depths.See §3.3.

Figure 5 .
Figure 5. Comparing performance of models trained on combinations of formal and harmonic features ( period prediction).(a) Performance during training (r ≥ 0), (b) compares performance of optimal models, (c) plots the normalized evolutionary signature activations over time for the optimal form + harmony model.(d ) A summary of the features weightings per signature in this model, with suggested genre/descriptive annotations (signatures 1-5 correspond to colours red, green, blue, magenta, cyan, respectively), (e) summarizes performance of models with formal features using different coarse-grainings of formal categories ( projection matrices) and normalizations.